pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, graphlayouts, ggforce, tidytext, tidyverse, skimr)Take-home Exercise 03
1. Overview
1.1 Background
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. Can you help FishEye develop a new visual analytics approach to better understand fishing business anomalies
1.2 The Task
1.3 Reference
Reference to Mini-Challenge 3 of VAST Challenge 2023.
2. The Data
2.1 Getting Started
Install R packages needed for data preparation, data wrangling, data analysis and visualisation using the code chunk below.
2.2 Data Import
The code chunk below imports data using fromJSON() from jsonlitepackage into R environment.
mc3_data <- fromJSON("data/MC3.json")2.3 Data Wrangling
2.3.1 The edges data
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
mc3_edges <- as_tibble(mc3_data$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source != target) %>%
ungroup()2.3.2 The nods data
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)Instead of using the nodes data table extracted from mc3_data, we will prepare a new nodes data table by using the source and target fields of mc3_edges data table. This is necessary to ensure that the nodes in nodes data tables include all the source and target values.
id1 <- mc3_edges %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges %>%
select(target) %>%
rename(id = target)
mc3_nodes1 <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes, unmatched = "drop")2.3.3 EDA
Exploring the edges data
Display the statistics summary of mc3_edges tibble data frame using skim() from skimr package per code chunk below.
skim(mc3_edges)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
We can tell there is no missing value form the report above.
Display the mc3_edges tibble data frame as an interactive table on the html document using datatable() from DT package per code chunk below.
DT::datatable(mc3_edges)ggplot(data = mc3_edges, aes(x = type)) +
geom_bar()
Exploring the nodes data
Display the statistics summary of mc3_nodes tibble data frame using skim() from skimr package per code chunk below.
skim(mc3_nodes)| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
We can tell there is no missing value form the report above.
Display the mc3_nodes tibble data frame as an interactive table on the html document using datatable() from DT package per code chunk below.
DT::datatable(mc3_nodes)ggplot(data = mc3_nodes, aes(x = type)) +
geom_bar()
3. Visualisation and Analysis
3.1 Initial Network Visualisation
3.1.1 Network model with tidygraph
mc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = mc3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_betweenness())mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(size=betweenness_centrality, colors="lightblue", alpha=0.5)) +
scale_size_continuous(range = c(1, 10)) +
theme_graph()
3.1.2 Text sensing with tidytext
Simple word count
Calculating the number of word “fish” appeared in the field product_services using the code chunk below.
mc3_nodes %>%
mutate(n_fish = str_count(product_services, "fish")) %>%
arrange(desc(n_fish))# A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Taylor LLC ZH Comp… 138982. Fish (anchovy, … 11
2 Gvardeysk Sextant ОАО Cargo Uziland Comp… 73027. Fish salads (It… 11
3 Mclaughlin, Valdez and Moo… ZH Comp… 6154. Fresh, cured, o… 8
4 Murphy, Fisher and Barnes ZH Comp… 256739. Frozen fish blo… 7
5 Garcia, Lloyd and Houston ZH Comp… 53509. Bottom fishes; … 7
6 SeaSelect Foods Salt spray Marebak Comp… 41902. European whole … 7
7 Ancla del Este Sagl Oceanus Comp… 16167. Bottom fishes, … 7
8 Monroe, Smith and Miller ZH Comp… 250255. Frozen processe… 6
9 Arunachal Pradesh s S.A. d… Marebak Comp… 60346. Offers a wide r… 6
10 suō yú Ltd. Liability Co Coralm… Comp… 31567. Offers a wide r… 6
# ℹ 27,612 more rows
Tokenisation
In the code chunk, using unnest_token() from tidytext package to split text in product_services field into words.
token_nodes <- mc3_nodes %>%
unnest_tokens(word, product_services)Now we can visualise the words extracted by using the code chunk below.

token_nodes %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x=word, y=n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x="Count", y = "Unique words", title = "Count of unique words found in product_services field")Noticint that the top 15 most frequent words contains a few stop words, e.g. “and” and “of”.
Removing stopwords
We will use the stop_words function from tidytext package to clean up stop words.
stopwords_removed <- token_nodes %>%
anti_join(stop_words)- The anti_join() from dplyr package is used to remove all stop words.
Then we can visualise the words extracted using the code chunk below.

stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x=word, y=n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x="Count", y = "Unique words", title = "Count of unique words found in product_services field")